options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("pacman")Take-home Exercise 1: Demographic structures and distribution of Singapore in 2024
1 Overview
A local online media company that publishes daily content on digital platforms is planning to release an article on demographic structures and distribution of Singapore in 2024.
2 Objective
Assuming the role of the graphical editor of the media company, you are tasked to prepare at most three data visualisations for the article.
3 Analytical Toolkit: RStudio
RStudio and Quarto are used as the primary analytical toolkit for this project. The data is processed using appropriate tidyverse family of packages and the data visualisation prepared using ggplot2 and its extensions.
Before we get started, it is important for us to ensure that the required R packages have been installed.
If you have yet to install pacman, install itby typing below in the Console:
We then load the following R packages using the pacman::p_load() function:
- tidyverse, a family of modern R packages specially designed to support data science, analysis and communication task including creating static statistical graphs.
- patchwork for combining multiple ggplot2 graphs into one figure.
- plotly, R library for plotting interactive statistical graphs.
- ggrepel: a R package provides geoms for ggplot2 to repel overlapping text labels.
- ggthemes: a R package provides some extra themes, geoms, and scales for ggplot.
- hrbrthemes: a R package provides typography-centric themes and theme components for ggplot2.
- qreport: Provides statistical components, tables, and graphs. - ggiraph: for making ‘ggplot’ graphics interactive.
pacman::p_load(tidyverse, patchwork,
plotly, ggrepel,
ggthemes, hrbrthemes, ggiraph, DT, qreport)4 Data Preparation
Singapore Residents by Planning Area / Subzone, Single Year of Age and Sex, June 2024 dataset shared by Department of Statistics, Singapore(DOS)
4.1 Load the Data
First we load the data.
demographic_data <- read_csv("data/respopagesexfa2024.csv")
4.2 Check the Data
From the first glance, we notice that there are ‘0’ Pop in the dataset. For this exercise, we will be focusing on the top-level visualization per Planning Areas and granular details like Population per Subzone and Floor Area aren’t necessary. We should exclude those rows with zero population values at the Planning Area level. This will help clean up the data and make the visualizations clearer by removing unnecessary zeros. We will do it in Chapter 4.3
Here I am using qreportpackage’s dataOverview that I already pre-load earlier.
dataOverview(demographic_data, plot = c("none"),)demographic_data has 75696 observations (75696 complete) and 7 variables (7 complete)
|Variable |Type | Distinct| Info| Symmetry| NAs|Rarest Value | Frequency of Rarest Value|Mode | Frequency of Mode|
|:--------|:----------|--------:|-----:|--------:|---:|:-----------------------|-------------------------:|:-----------|-----------------:|
|PA |Nonnumeric | 55| 0.999| 0.991| 0|Central Water Catchment | 228|Bukit Merah | 3876|
|SZ |Nonnumeric | 332| 1.000| 1.000| 0|Admiralty | 228|Admiralty | 228|
|AG |Discrete | 19| 0.997| 1.000| 0|0_to_4 | 3984|0_to_4 | 3984|
|Sex |Discrete | 2| 0.750| 1.000| 0|Females | 37848|Females | 37848|
|FA |Discrete | 6| 0.972| 1.000| 0|<= 60 | 12616|<= 60 | 12616|
|Pop |Continuous | 183| 0.831| 6.953| 0|1260 | 1|0 | 41742|
|Time |Discrete | 1| 0.000| 1.000| 0|2024 | 75696|2024 | 75696|
Let’s also count what’s the total Pop
cntpop <- demographic_data %>%
summarise(Pop = sum(Pop, na.rm = TRUE))
cat(cntpop$Pop)4187720
Observation
The data shows Singapore Residents by Planning Area / Subzone, Single Year of Age and Sex as of June 2024 with total population of 4,187,720.
We observe that there are 75,696 rows and 7 columns. No missing values are observed. Refer to the column legend in Appendix A
There are a total of seven attributes. 5 of them are categorical data type and the other three are in numerical data type.
- The categorical attributes are: PA, SZ, AG, Sex, FA.
- The numerical attributes are: Pop, Time.
We can also observe how many distinct values for each Variable. This will help us think what to use for our visualization.
4.3 Data Preparation
demographic_data_clean <- demographic_data %>%
filter(Pop > 0)
DT::datatable(demographic_data_clean , options = list(
columnDefs = list(list(className = 'dt-center', targets = 5)),
pageLength = 5,
lengthMenu = c(5, 10, 15, 20)))5 Data Visualisation, Observation, and Insights
5.1 Top 10 Planning Areas (PA) Ranked by Size of Resident Population (Pop)

top10PA <- demographic_data_clean %>%
group_by(PA) %>%
summarise(Pop = sum(Pop, na.rm = TRUE)) %>%
slice_max(order_by = Pop, n = 10)
top10plot <- ggplot(data = top10PA,
aes(y = reorder(PA, Pop/1000), x = Pop/1000)) + # reorder PA by Pop
geom_col(show.legend = FALSE, fill = "pink4") +
geom_text(aes(label = (Pop/1000)),
hjust = -0.2, color = "black", size = 3) +
ggtitle("Top 10 Planning Areas in 2024\nRanked by Size of Singapore Resident Population",
subtitle = paste("Total resident population:",
format(round(cntpop$Pop / 1000, 2), big.mark = ","),
"thousand")) +
labs(
y = NULL,
x = "Resident Population\nin thousands (‘000)",
caption = "Source: singstat.gov.sg"
) +
theme_ipsum(base_family = "Arial",
plot_title_size = 14,
subtitle_size = 10,
caption_size = 8,
plot_title_face = "bold",
caption_face = "italic",
grid = "",
axis_title_face = "bold",
axis_title_size = 11) +
theme(axis.text.x = element_blank(),
axis.text.y = element_text(size=11, face="bold"),
axis.title.x = element_text(hjust = 0.5)
)+
scale_x_continuous(expand = expansion(mult = c(0, 0.1)))
top10plot# Total population of the top 10 Planning Area
cntpoptop10 <- top10PA %>%
summarise(Pop = sum(Pop, na.rm = TRUE))
cat(cntpoptop10$Pop)2358550
# The percentage of the population of the top 10 most populous Planning Area relative to the total population.
perc_pop_top10 <- (cntpoptop10 / cntpop) * 100
cat(perc_pop_top10$Pop)56.32062
Insights Plot 1
Slightly over half (56.3%) of the 4,187.72 thousand (~4.19 million) residents in Singapore stayed in the top 10 planning areas of residence.
There were five planning areas with more than 250,000 residents each, namely Tampines, Bedok, Sengkang, Jurong West, and Woodlands.
Tampines was the most populated with 284,720 residents.
5.2 Age Distribution

AG_levels <- c(
"0_to_4", "5_to_9", "10_to_14", "15_to_19", "20_to_24",
"25_to_29", "30_to_34", "35_to_39", "40_to_44", "45_to_49",
"50_to_54", "55_to_59", "60_to_64", "65_to_69", "70_to_74",
"75_to_79", "80_to_84", "85_to_89", "90_and_over"
)
AGsum2 <- demographic_data_clean %>%
group_by(AG) %>%
summarise(Pop = sum(Pop, na.rm = TRUE), .groups = "drop") %>%
mutate(AG = factor(AG, levels = AG_levels)) %>%
arrange(AG) %>%
mutate(
Pop_share = Pop / sum(Pop),
cum_share = cumsum(Pop_share),
percentile = round(cum_share * 100, 1)
)
# Get the median group
median_AGsum2 <- AGsum2 %>%
filter(cumsum(Pop) >= sum(Pop) / 2) %>%
slice(1) %>%
pull(AG)
# Get the third quantile or 75 percentile
q3_AGsum2 <- AGsum2 %>%
filter(cumsum(Pop) >= sum(Pop) * 0.75) %>%
slice(1) %>%
pull(AG)
AGsumplot2 <- AGsum2 %>%
ggplot(
aes(y = Pop/1000, x = AG)) + # reorder AG_recode by Pop
geom_col(show.legend = FALSE, fill = "pink4") +
geom_text(aes(label = (Pop/1000)),
color = "black", size = 3, angle = 90, hjust = -0.2) +
ggtitle("Age Distribution of Singapore Resident Population",
subtitle = "in thousands (‘000)") +
annotate("segment",
x = median_AGsum2, xend = median_AGsum2,
y = 0, yend = 360,
color = "red",
linewidth = 0.7,
linetype = "dotted") +
annotate("text",
x = median_AGsum2,
y = 370,
label = "Median",
color = "red",
size = 2.8)+
annotate("segment",
x = q3_AGsum2, xend = q3_AGsum2,
y = 0, yend = 360,
color = "skyblue",
linewidth = 0.7,
linetype = "dotted") +
annotate(
geom = "text",
x = q3_AGsum2,
y = 370,
label = paste0("Q3"),
color="skyblue",
size = 2.8)+
labs(
y = NULL,
x = "Age Group",
caption = "Source: singstat.gov.sg") +
theme_ipsum(base_family = "Arial",
plot_title_size = 14,
subtitle_size = 10,
caption_size = 8,
plot_title_face = "bold",
caption_face = "italic",
grid = "Y",
axis_title_face = "bold",
axis_title_size = 11) +
theme(axis.text.x = element_text(size=8, face="bold", angle = -45, hjust = 0),
axis.text.y = element_blank(), #element_text(size=11, face="bold"),
axis.title.x = element_text(size=11, face="bold", hjust = 0.5)
)+
scale_y_continuous(expand = expansion(mult = c(0, 0.02))) +
scale_x_discrete(labels=c("0_to_4" = "0-4",
"5_to_9" = "5-9",
"10_to_14" = "10-14",
"15_to_19" = "15-19",
"20_to_24" = "20-24",
"25_to_29" = "25-29",
"30_to_34" = "30-34",
"35_to_39" = "35-39",
"40_to_44" = "40-44",
"45_to_49" = "45-49",
"50_to_54" = "50-54",
"55_to_59" = "55-59",
"60_to_64" = "60-64",
"65_to_69" = "65-69",
"70_to_74" = "70-74",
"75_to_79" = "75-79",
"80_to_84" = "80-84",
"85_to_89" = "85-89",
"90_and_over" = "> 90"))
AGsumplot2# Details of the AG distribution
AGsum2# A tibble: 19 × 5
AG Pop Pop_share cum_share percentile
<fct> <dbl> <dbl> <dbl> <dbl>
1 0_to_4 170930 0.0408 0.0408 4.1
2 5_to_9 202420 0.0483 0.0892 8.9
3 10_to_14 204610 0.0489 0.138 13.8
4 15_to_19 211560 0.0505 0.189 18.9
5 20_to_24 225020 0.0537 0.242 24.2
6 25_to_29 270090 0.0645 0.307 30.7
7 30_to_34 321010 0.0767 0.383 38.3
8 35_to_39 315180 0.0753 0.459 45.9
9 40_to_44 310700 0.0742 0.533 53.3
10 45_to_49 301820 0.0721 0.605 60.5
11 50_to_54 307760 0.0735 0.678 67.8
12 55_to_59 294500 0.0703 0.749 74.9
13 60_to_64 297020 0.0709 0.820 82
14 65_to_69 266580 0.0637 0.883 88.3
15 70_to_74 206760 0.0494 0.933 93.3
16 75_to_79 134810 0.0322 0.965 96.5
17 80_to_84 77750 0.0186 0.983 98.3
18 85_to_89 44050 0.0105 0.994 99.4
19 90_and_over 25150 0.00601 1 100
Insights Plot 2
The median age falls within 40 to 44 age group with 25% of the population is aged 60 and above and 18% aged 65 and older. This indicates an aging population, a trend characterized by an increasing proportion of older individuals (typically defined as 65 years and over).
The youngest age groups (0–4, 5–9, 10–14) collectively account for only about 13.8% of the population, suggesting lower birth rates in recent years.
According to the Ministry of Manpower, the working-age population is defined as those aged 15 to 64, who make up approximately 68.2% of the total population. This reflects a strong labour force, though future demographic challenges may arise as this group continues to age.
5.3 Sex distribution by Age Group

pyramid_data <- demographic_data_clean %>%
group_by(AG, Sex) %>%
summarise(Population = (sum(Pop, na.rm = TRUE)/1000)) %>%
ungroup()
totpop_pyramid <- sum(pyramid_data$Population)
pyramid_data <- pyramid_data %>%
mutate(PopPercentage = ifelse(Sex == "Females",
-round(Population / totpop_pyramid * 100, 2),
round(Population / totpop_pyramid * 100, 2)),
Signal = ifelse(Sex == "Females", -1, 1))
pyramid_data$AG <- factor(pyramid_data$AG,
levels = c("0_to_4", "5_to_9", "10_to_14", "15_to_19",
"20_to_24", "25_to_29", "30_to_34", "35_to_39",
"40_to_44", "45_to_49", "50_to_54", "55_to_59",
"60_to_64", "65_to_69", "70_to_74", "75_to_79",
"80_to_84", "85_to_89", "90_and_over"),
labels = c("0-4", "5-9", "10-14", "15-19",
"20-24", "25-29", "30-34", "35-39",
"40-44", "45-49", "50-54", "55-59",
"60-64", "65-69", "70-74", "75-79",
"80-84", "85-89", ">90"),
ordered = TRUE)
pyramid_plot <- ggplot(pyramid_data, aes(x = AG, y = PopPercentage, fill = Sex)) +
geom_bar(stat = "identity") +
geom_text(aes(y = PopPercentage + Signal * 0.5, label = abs(PopPercentage)),
size = 3, color = "black") +
coord_flip() +
scale_fill_manual(values = c("Females" = "pink2", "Males" = "steelblue"),
guide = guide_legend(override.aes = list(fill = NA))) +
scale_y_continuous(labels = abs) +
ggtitle("Population Pyramid of Singapore Residents\nby Age and Sex 2024",
subtitle = "in percentage (%)") +
labs(
y = "Population (%)",
x = "Age Group",
fill = "Sex",
caption = "Source: singstat.gov.sg") +
theme_ipsum(base_family = "Arial",
plot_title_size = 14,
subtitle_size = 10,
caption_size = 8,
plot_title_face = "bold",
caption_face = "italic",
grid = "Y",
axis_title_face = "bold",
axis_title_size = 11,
axis_text_size = 8) +
theme(
strip.text = element_text(face = "bold"),
axis.title.x = element_text(hjust = 0.5),
axis.title.y = element_text(hjust = 0.5),
#axis.text.y = element_text(size = 8),
legend.position = "top",
legend.title = element_blank(),
legend.justification = c(0.45, 0),
legend.margin = margin(t = -20, r = 0, b = -10, l = 0, unit = "pt"))
pyramid_plot# Details of the population pyramid distribution
pyramid_data# A tibble: 38 × 5
AG Sex Population PopPercentage Signal
<ord> <chr> <dbl> <dbl> <dbl>
1 0-4 Females 83.4 -1.99 -1
2 0-4 Males 87.5 2.09 1
3 10-14 Females 100. -2.4 -1
4 10-14 Males 104. 2.49 1
5 15-19 Females 104. -2.49 -1
6 15-19 Males 107. 2.56 1
7 20-24 Females 110. -2.64 -1
8 20-24 Males 115. 2.74 1
9 25-29 Females 135. -3.22 -1
10 25-29 Males 135. 3.23 1
# ℹ 28 more rows
Insights Plot 3
In the youngest age group (0-4), there are slightly more males than females.
The 25-29 age group shows a near-equal population size for both genders.
The gender gap widens in the older age cohorts, with majority female in the 80+ age groups. This show that females live longer than males on average, consistent with the life expectancy at birth between the different gendersfrom 2023 report by the Singapore Department of Statistics.
6 Reference
- ggplot for categorical-data
- Describe function
- gt package
- theme for ggplot2
- Recode Values with dplyr
- Customize tick marks and labels
- National Statistical Standards Recommendations on Definition and Classification of Age
- Cencus of Population 2020
- Population Pyramid Plot
- Ageing Population
7 Appendix
7.1 Appendix A
| Column Name | Description |
|---|---|
| PA | Planning Area |
| SZ | Subzone |
| AG | Age Group |
| Sex | Sex |
| FA | Floor Area of Residence |
| Pop | Resident Count (Population) |
| Time | Time / Period |